For this project I downloaded and cleaned the data of more than 2500 listings of used Ford Focus cars from the biggest Hungarian online used car listing site, Hasznaltauto.hu. I chose specificaly the Ford Focus for this task because:
So I hope the dataset is diverse enough to offer some interesting insights. My major questions are quite obviuous:
I included the Python scripts for scraping and cleaning the data in the data-scraper directory. I used pandas, requests, BeautifulSoup and pymongo for MongoDB as an itermediary data store.
The main steps were:
(Note: The data was obtained on July, 27 and was not updated since then. This sample is only a tiny fraction of the whole database of Hasznaltauto.hu, and not intended for any purposes other than this learning project. I am not affiliated to Hasznaltauto.hu in any form.)
## 'data.frame': 2552 obs. of 26 variables:
## $ Doors : num 5 5 5 5 5 5 4 5 5 5 ...
## $ C.Max_variant : Factor w/ 2 levels "False","True": 1 1 2 1 1 1 1 1 1 1 ...
## $ Trunk.capacity.liters: num 520 385 NA NA NA NA 490 520 482 482 ...
## $ Drive : Factor w/ 2 levels "All-wheel","Front-wheel": 2 2 2 2 2 2 2 2 2 2 ...
## $ Capacity : num 1596 1560 1560 1796 1753 ...
## $ Mileage.km : num 192000 205000 210599 182146 175117 ...
## $ Form : Factor w/ 7 levels "Cabrio","Combi",..: 2 3 4 2 2 3 6 2 2 2 ...
## $ A.C.type : Factor w/ 3 levels "Automatic A/C",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Documents.valid : Date, format: NA "2018-01-31" ...
## $ Own.weight.kg : num 1135 1352 NA NA NA ...
## $ Horsepower : num 101 90 109 116 116 90 101 101 101 101 ...
## $ Condition : Factor w/ 14 levels "Damaged","Damaged front",..: 10 10 10 10 10 10 12 11 10 10 ...
## $ Price.EUR : int 2678 2453 2714 2780 2780 2780 2816 2910 2940 2944 ...
## $ Total.weight.kg : num 1665 1790 NA NA NA ...
## $ Fuel : Factor w/ 5 levels "Diesel","Electric",..: 4 1 1 4 1 1 4 4 4 4 ...
## $ Manufactured.year : int 2003 2005 2003 2002 2007 2005 2004 2003 2007 2003 ...
## $ Manufactured.month : int NA NA NA NA NA NA NA NA NA NA ...
## $ Transmission.type : Factor w/ 2 levels "Automatic","Manual": 1 2 2 2 2 2 2 2 2 2 ...
## $ Gears : num 4 5 5 5 5 5 5 5 5 5 ...
## $ Documents : Factor w/ 7 levels "No documents",..: 7 6 6 6 6 6 6 6 6 6 ...
## $ Paint : Factor w/ 19 levels "Beige","Black",..: NA 3 7 15 15 18 5 15 18 NA ...
## $ Shade : Factor w/ 3 levels "Dark","Light",..: NA 1 3 3 3 3 3 3 3 NA ...
## $ Metallic : Factor w/ 2 levels "False","True": NA 1 1 1 1 1 2 2 1 NA ...
## $ Undamaged : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ Age.year : num 14 12 14 15 10 12 13 14 10 14 ...
## $ Mileage.year : num 13714 17083 15043 12143 17512 ...
## Doors C.Max_variant Trunk.capacity.liters Drive
## Min. :2.000 False:2335 Min. :277 All-wheel : 15
## 1st Qu.:5.000 True : 217 1st Qu.:385 Front-wheel:2515
## Median :5.000 Median :476 NA's : 22
## Mean :4.823 Mean :436
## 3rd Qu.:5.000 3rd Qu.:482
## Max. :5.000 Max. :534
## NA's :20 NA's :106
## Capacity Mileage.km Form A.C.type
## Min. : 138 Min. : 1 Cabrio : 18 Automatic A/C: 150
## 1st Qu.:1560 1st Qu.:121660 Combi :1144 Digital A/C : 856
## Median :1596 Median :169662 Hatchback: 920 Manual A/C :1447
## Mean :1613 Mean :163311 Minivan : 217 NA's : 99
## 3rd Qu.:1753 3rd Qu.:211581 Notchback: 9
## Max. :2522 Max. :461000 Sedan : 242
## NA's :9 NA's :71 Sport : 2
## Documents.valid Own.weight.kg Horsepower Condition
## Min. :2015-07-31 Min. :1050 Min. : 75.0 Normal :1082
## 1st Qu.:2018-03-31 1st Qu.:1220 1st Qu.: 95.0 Excellent: 658
## Median :2018-10-31 Median :1307 Median :109.0 Spared : 534
## Mean :2018-10-25 Mean :1295 Mean :109.2 Recent : 114
## 3rd Qu.:2019-05-31 3rd Qu.:1386 3rd Qu.:116.0 Undamaged: 89
## Max. :2022-05-31 Max. :1760 Max. :351.0 Damaged : 30
## NA's :586 NA's :82 NA's :5 (Other) : 45
## Price.EUR Total.weight.kg Fuel Manufactured.year
## Min. : 324 Min. : 187 Diesel :1357 Min. :1995
## 1st Qu.: 2613 1st Qu.:1720 Electric : 5 1st Qu.:2004
## Median : 4248 Median :1825 Ethanol : 3 Median :2007
## Mean : 6000 Mean :1800 Gasoline :1171 Mean :2008
## 3rd Qu.: 7822 3rd Qu.:1900 Gasoline/LPG: 15 3rd Qu.:2012
## Max. :46969 Max. :2180 NA's : 1 Max. :2017
## NA's :83
## Manufactured.month Transmission.type Gears
## Min. :10.00 Automatic: 72 Min. :4.00
## 1st Qu.:10.00 Manual :2468 1st Qu.:5.00
## Median :11.00 NA's : 12 Median :5.00
## Mean :10.99 Mean :5.22
## 3rd Qu.:12.00 3rd Qu.:5.00
## Max. :12.00 Max. :7.00
## NA's :2026 NA's :42
## Documents Paint Shade Metallic
## No documents : 150 Silver :434 Dark : 316 False: 836
## Overdue, domestic : 39 Blue :420 Light : 149 True :1233
## Overdue, foreign : 57 Gray :388 Regular:1604 NA's : 483
## Temp.whitdrawn, domestic: 59 Black :292 NA's : 483
## Temp.whitdrawn, foreign : 32 White :241
## Valid, domestic :2121 (Other):294
## Valid, foreign : 94 NA's :483
## Undamaged Age.year Mileage.year
## Mode :logical Min. : 0.000 Min. : 0.056
## FALSE:75 1st Qu.: 5.000 1st Qu.:12984.615
## TRUE :2477 Median :10.000 Median :16785.714
## Mean : 9.495 Mean : Inf
## 3rd Qu.:13.000 3rd Qu.:25331.125
## Max. :22.000 Max. : Inf
## NA's :71
Our main target variable in the analysis is the price. The distribution is positively skewed, the long tail reaches up to well over EUR 40 000, while the peak of the distribution is around only EUR 3000, with some observation as low as few hundred euros (or even lower). There’s also a little secondary peak at around EUR 14 000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 324 2613 4248 6000 7822 46969
The very low price points might belong to damaged cars. The dataset contains descriptions for the condition of the cars, from which we can construct a boolean variable for indicating whether according to the seller the car is not damaged (eg. the condition is either ‘Excellent’, ‘Normal’, ‘Undamaged’, ‘Spared’ or ‘Recent’).
From 2552 observation only 75 is labeled as damaged, and majority of it is indeed offered for sale under EUR 1000. It might be useful to filter these out later.
Transforming the long-tail data (excluding damaged cars) reveals a more (but not completely) symmetrical distribution.
The other important continous variable is the number of kilometers each car run in their lifetime. The shape of the distribution (excluding damaged cars) is fairly symmetrical, except the huge bar around 0.
This is due to cars in the dataset that are either new (some dealerships do advertise new cars on the site) or incorrectly stated as having run 1 or very few kilometers.
I consider a car ‘unusued’ if it ran less than 1000km. To remove these and the damaged cars we can construct a subset of the car_data dataframe called undamaged_used_cars.
Using this subset mileage shows a prettier distribution: it’s slightly leptokurtic, but seems to be fairly close to normal distribution as the dashed curve shows.
From Manufactured.year column we can construct the Age.year variable, that is a reasonable proxy to the real age of the car (it might miss the real age by maximum about half year in either direction as the data was downloaded slightly after mid-2017).
Unfortunatelly only a small number of observations contain the month as well, so there’s no way of defining age more precisely.
Count of values of Manufactured.month:
## 10 11 12 NA's
## 188 156 182 2026
The distribution is slightly negatively skewed with a noticable peak at the mode of 13 years and a smaller secondary peak around 5-6 years. The median is 10 years (slightly above the mean of 9.5 years), with the quartiles being 5 and 13 years.
Plotting the density of theoretical normal distribution with the same mean and standard error reveals the slight differences.
A potential metric for how intensively was a car used before can be the milage per year (meaning how many kilometers did the car travel on average per year during its lifetime). The higher the number, the more active was the usage.
This calculation have to exclude cars made in 2017 as their age in years is 0, therefore I updated the undamaged_unused_cars subset with a condition of Age.year > 0 (this is not disruptive change as it filters out only 8 observations on top of the existing conditions).
This constructed variable is again positively skewed, the tail reaches above 7500km/year while the mode is around 13-14 000km/year. With a log10() transformation the distribution gets closer to normal.
There seven different available format in the dataset, of which combi and hatchback are the two most frequent, while the cabrios, notchbacks and sports are quite rare.
##
## Cabrio Combi Hatchback Minivan Notchback Sedan Sport
## 18 1025 727 210 6 223 2
In the dataset manual transmission and manual air-conditioning is by far the dominant, just as 5-door variant over 3 or 4-doors types.
In terms of fuel the two common options (gasoline and diesel) dominate over mixed and alternative fuels (ethanol and electric drive). For each plot I excluded observations where the data was missing.
Some of the cars does not have proper legal papers for domestic usage (eg, they are imported or damaged, etc.). Most of the cars have proper valid documents but missing them could be a negative factor in pricing for the rest.
##
## No documents Overdue, domestic Overdue, foreign
## 2 35 43
## Temp.whitdrawn, domestic Temp.whitdrawn, foreign Valid, domestic
## 45 27 1978
## Valid, foreign
## 81
In 2069 cases we have data on the paint of the cars, not just the color but in some cases also on the shade (light/dark) and whether the car has a metallic or regular polish.
The dataset contains 23 variables of 2552 used cars. The continous variables are Manufactured.year and Mileage.km, while other numerical columns act more like ordinal variables (eg. engine capacity, no. of doors, no. of gears, weight, etc.) as these are determined by the variant of the model.
We also have a handful of interesting categorical variables, like the color and shade of the cars, the form type (‘combi’, ‘hatchback’, etc.).
The categorical variables somewhat contradict my initial assumption that the higher number of possible variations of the model might create a diverse dataset. Many of the categorical columns (eg. color, transmission type, fuel, etc.) are heavily concentrated around one or few typical values, though ‘exotic’ variants are also present in the data.
The most popular colors are silver, blue and gray, but if you think of silver as one kind of gray, gray is the most frequent color on the market by far. On the other end of the scale there are handful of ‘exotic’ colors, like champagne, gold and magenta.
The price is our target variable. The distribution is not surprising, shows the natural traits of a monetary type variable, that can be treated by using log10 transformation on it. That brings it closer to a normal(ish) distribution and hopefully it helps reveal clear relationships with other variables.
There are also two interesting continous variables that might play important role indetermining the price of the car: age and mileage. None of these show extraordinary distributions, though they are not perfectly normal either, but the most common transformation do not provide meaningful benefits in these cases.
Besides age and mileage I expect some of the technical and aesthetical features to have some impact and to show relationship with the other variables, namely:
In a linear model we can use these via ‘dummy’ variables to gauge whether differing from the most common values have significant impact on pricing.
I derived the age (counted in years) from the year when the car was manufactured. This does not alter the meaning of the data much, but it’s more reasonable to use it when jitter is added (as age can have fractional values, while the year of manufacturing can’t), even though it’s just a proxy for the real age.
I created a variable for the average annual mileage (Mileage.km/Age.year) that might help explaining the difference in price of cars with the same age. It shows that how actively the car was used during its lifetime.
The initial look at prices and age surfaced two issues with the data:
1.it contains observations of damaged cars, that might have completely different profile than the undamaged ones 2. it contains cars that have only 1 or very few kilometers registered, so these are either new or mislabeled entries.
I removed both groups in the undamaged_used_cars subset without dramatically reducing the size of the sample. By removing these factors we can focus on the other important features in the dataset.
## Total number of cars: 2552
## of which undamaged and used: 2211
Quick glance on pairplots of the continous variables, using the log of Price.EUR and not transforming the others, we see a very high negative correlation with Age (-0.9) and it materializes in a clear pattern on the point plot also.
Total mileage (Mileage.km) also relatively high and even the constructed average annual mileage (Mileage.year) has detectable correlation with price. This correlation of Mileage.yearis lower when I use the log-transformed Mileage.year.
Plotting mileage and the log-transformed price for all cars (including damaged and unused ones) reveal a negative relationship between the two variables, supporting the intuitive idea that a car that has run more kilometers should be cheaper.
On the other hand the pattern is quite dispersed, the connection between the two variables does not seem to be strong (the correlation was 0.48), even when setting lower alpha to handle overplotting.
When we use only the undamaged, used cars from of the dataset the picture is better, but the relationship is still not very strong in this form.
As age was calculated from the Manufactured.year column, it has discrete values. This and the overplotting makes it hard to assess the relationship (we already know that the correlation is strong).
Adding some jitter and alpha, the picture gets clearer, though there are some seemingly outlier points in both direction, and the plot seems too dense around 0 kilometer.
If we ‘paint’ barely used or damaged cars (red), some of the outliers and much of the crowdedness around 0 pops out. It reinforces the earlier idea to exclude these cars from the analysis.
It seems we find a rather strong relationship between the age and the price of a car, which is not surprising, but good to see it proved.
Plotting mileage against age we get a rather dispersed pattern. The median mileage in the different age cohorts does not change in a linear way with age, the median mileage is roughly equal in ages between 8 and 14 years, while it increases substantiall from age 1 to 8.
It seems that the younger cars are more diverse in their annual average mileage (points are spread in a wider range). This might be due to the relatively lower number of observations in lower age cohorts where a few extreme (maybe erroneous) overall mileage value can affect the yearly average more in both directions.
In the higher age segments the distribution seems more consistent, the inter- quartile width is quite stable, while all the quartiles seem to follow a rather linear negative line.
Plotting the annual average mileage and the price (both on log-scales) does not reveal a very specific direct relationship.
There is some sort of relationship between the type of fuel the car uses and the age or the price, though above we already saw that alternative fuel types (not diesel or gasoline) are quite rare in the dataset.
Also cars with automatic transmission tend to be somewhat younger and a little pricier, based on the boxplots (though the mean age and mean price is roughly the same in both groups).
Engine capacity vs. Horsepower produces a great deal of overplotting. Capacity here acts like an ordinal feature, therefore the points form horizontal lines with a few outliers (138 cm3 is clearly not true, while 350hp might be the result of some tuning).
2D bins help to avoid overplotting and also nicely packs the slightly different engine capacity values into bins. There’s clearly a positive relationship between engince capacity and the power (not surprisingly).
Horsepower also seem to have a positive relatioship with price, though the ordinal nature of horsepower makes it hard to visualy validate this assumption.
Price distribution clearly differs by form variants.
Separating the four major varieties minivans seem to differ to a large extent from the other three distributions in its shape, while sedans seem to have a reasonably lower median price. Minivans are also very different in terms of age from the other major categories.
## # A tibble: 4 x 4
## Form median_price mean_price standard_deviation
## <fctr> <dbl> <dbl> <dbl>
## 1 Combi 4085.0 5027.943 3298.666
## 2 Hatchback 4248.0 5463.437 4803.712
## 3 Minivan 3950.5 4058.290 1018.962
## 4 Sedan 2910.0 3898.260 2494.989
The major finding in this section is that the relationship between age and price seem to be a very strong one, especially when we filter out the damaged or unused cars from the sample. Most likely age is the major factor in the pricing of the cars.
Also the mileage turned out to be much less important than I initially thought at least directly plotted with the price, and in this aspect, the calculated Mileage.year variable did not show important relation to the price either.
In the categorical features cars running on gasoline and with manual transmission tend to be older and somewhat cheaper.
Age and mileage are much less connected than I expected, many cars in the middle of the age range show similar milage and in a single age cohort mileage can vary in a wide range.
Also it turned out that the Minivan variants have a very different age distribution than the other major categories. Some background to that: the mini- van version was initially called Ford Focus C-Max but from 2010 it is a separate model, not a Focus variant. This explains this phenomenon (the dataset contains a boolean feature for C-max, which is equivalent of Form being Minivan).
Clearly the relationship between age and (log) price, the older the car, the smaller the price the seller asks for it.
Electric cars are clearly amongst the youngest and priciest ones, while diesel cars in the lower age segments tend to be on the cheaper end compared to gasoline cars with the same age.
We already know that silver, gray and blue combis and hatchbacks are the most common variations, but from the plot on the right it seems that buyers are better off buying violet or gold, maybe turquoise cars.
Not surprisingly the youngest and least used cars have licenses with the longest validity, but also interesting to see, that for almost all age above 3-4 the validity dates are disperesed in roughly the same range.
Minivans seem to differ not only in age and price distribution, the visible relationship between these to features is also visibly different (the slope of the regressionline is slightly smaller).
A nice way to visualize the relationship between age, mileage and price is to use the colors for the years. We get a nice rainbow-like pattern showing that the price of cars with the same age tend to group around the same values.
Also the color ‘bands’ have a slight negative slopes showing that cars with the same age but higher mileage tend to have lower price.
I think most of the important relationships were already revealed in the bivariate section, but the multivariate plots mostly underpinned these findings.
Clearly the most important is the role of age in the pricing (the older the car lower the price the seller can ask for it). Also it became clear that minivans differ from the other frequent variants in many aspects (pricing and age, most notably).
One interesting thing is that even though the average annual mileage in itself does not show strong relationship with price, in conjunction with age it might still explain a smaller part of the variance in prices, as from cars with same age the one which has higher annual mileage tend to have lower price tag.
One new observation is that among the younger cars the different type of fuels might have an impact on prices.
I created a linear modell from some of the features of the dataset to model the pricing of the cars. The target variable is the log10 of Price.EUR.
The R-squared value is slightly over 90, which leaves some room for improvement, still, it shows that the above mentioned variables do have significant impact in determining the price of the cars.
The most important variable is Age.year, but I also included the log transformed Mileage.year and also total Mileage.km, and added Horsepower and some dummy variables of form variety, fuel, air-conditioning and license validity.
One of the limitations I encountered was that adding color and polish into the model tends to improve R-squared but with the trade-off of losing few hundred observations due to missing color information. This improvement comes at a great cost and threatens with overfitting (creating a model that works perfectly on a certain small sample, but loses predictive power on bigger populations).
Also I assume that extracting more data from the listings (eg. text fields and descriptions) might help to improve the model with additional data that explains a little more of the variance of the prices.
##
## Call:
## lm(formula = log10(undamaged_used_cars$Price.EUR) ~ undamaged_used_cars$Age.year +
## log10(undamaged_used_cars$Mileage.year) + undamaged_used_cars$Mileage.km +
## factor(undamaged_used_cars$Fuel == "Diesel") + factor(undamaged_used_cars$Fuel ==
## "Gasoline") + factor(undamaged_used_cars$Form == "Minivan") +
## factor(undamaged_used_cars$Documents == "Valid, domestic") +
## factor(undamaged_used_cars$A.C.type == "Manual A/C") + undamaged_used_cars$Horsepower)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49218 -0.04004 0.00302 0.04688 0.46866
##
## Coefficients:
## Estimate
## (Intercept) 4.297e+00
## undamaged_used_cars$Age.year -5.052e-02
## log10(undamaged_used_cars$Mileage.year) -5.589e-02
## undamaged_used_cars$Mileage.km -6.392e-07
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE -6.400e-02
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE -4.912e-02
## factor(undamaged_used_cars$Form == "Minivan")TRUE 5.616e-02
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE 6.748e-02
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE -1.721e-02
## undamaged_used_cars$Horsepower 1.816e-03
## Std. Error
## (Intercept) 8.051e-02
## undamaged_used_cars$Age.year 9.964e-04
## log10(undamaged_used_cars$Mileage.year) 1.852e-02
## undamaged_used_cars$Mileage.km 5.590e-08
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE 1.798e-02
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE 1.792e-02
## factor(undamaged_used_cars$Form == "Minivan")TRUE 6.229e-03
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE 5.869e-03
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE 4.222e-03
## undamaged_used_cars$Horsepower 8.689e-05
## t value
## (Intercept) 53.373
## undamaged_used_cars$Age.year -50.697
## log10(undamaged_used_cars$Mileage.year) -3.018
## undamaged_used_cars$Mileage.km -11.434
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE -3.560
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE -2.740
## factor(undamaged_used_cars$Form == "Minivan")TRUE 9.016
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE 11.497
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE -4.076
## undamaged_used_cars$Horsepower 20.897
## Pr(>|t|)
## (Intercept) < 2e-16
## undamaged_used_cars$Age.year < 2e-16
## log10(undamaged_used_cars$Mileage.year) 0.002575
## undamaged_used_cars$Mileage.km < 2e-16
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE 0.000379
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE 0.006194
## factor(undamaged_used_cars$Form == "Minivan")TRUE < 2e-16
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE < 2e-16
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE 4.75e-05
## undamaged_used_cars$Horsepower < 2e-16
##
## (Intercept) ***
## undamaged_used_cars$Age.year ***
## log10(undamaged_used_cars$Mileage.year) **
## undamaged_used_cars$Mileage.km ***
## factor(undamaged_used_cars$Fuel == "Diesel")TRUE ***
## factor(undamaged_used_cars$Fuel == "Gasoline")TRUE **
## factor(undamaged_used_cars$Form == "Minivan")TRUE ***
## factor(undamaged_used_cars$Documents == "Valid, domestic")TRUE ***
## factor(undamaged_used_cars$A.C.type == "Manual A/C")TRUE ***
## undamaged_used_cars$Horsepower ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08195 on 2117 degrees of freedom
## (84 observations deleted due to missingness)
## Multiple R-squared: 0.906, Adjusted R-squared: 0.9056
## F-statistic: 2267 on 9 and 2117 DF, p-value: < 2.2e-16
I particularly like this plot because it packs two of the most important relationships in this analysis into one graph:
Age has a negative impact on price, the points are rather close to the regression line with negative slope.
In any age group the cars with lower average milage tend to have higher prices. The points above the regression line seem to be darker in general, but this gets even sharper in the lower half of the date range.
I like this plot because it clearly visualizes some differences between the main form varieties and fuel types. This plot helped me to recognize how different is the minivan from the others, which in turn helped me to learn the special history of the C-max variant (starting as a variant, later becoming a separate model).
2.This is the only variant where there are only diesel or gasoline fueled cars in the sample (no alternative fuels)
For the other models gasoline cars tend to have slightly (or even substantially) lower median price than diesel ones.
Finally a visual proof that the color of the car has no meaningful impact on the price the seller asks, neither if the car has one of the most frequent colors nor if it’s one of the exotic variants. All the plots show very similar distribution, except where we have too few observation, and exepct the minivans, that we already saw somewhat differs from the other categories.
No surprise that none of the variables in connection with the color was included in the linear model above.
Aside from this I like how the facet_grid() let’s me visualize at least 4 variables, not counting the aesthetics of the points, like size, color and shape.
It was exciting to explore a very recent, up-to-date, real-life dataset that I personally collected. The dataset also reinforced some of my initial assumptions (eg. the age must be an important factor) but also falsified a few (like the major importance of mileage in the pricing) and answered my questions, like does the color of the car matter in the price.
The custom dataset was not as clean and well balanced between different features as many of the usual learning datasets (eg. some of the features turned out to be concentrated around one or few values). Therefore I had to go back to the cleaning script several times, when something popped out in first part of this analysis.
I plan to work on with this dataset, and also I would like to broaden it, in two ways:
Get more data of each listing, eg. qualitative technical and convenience features from text fields, geolocation (where is the seller/car located), etc. These might help to improve the predicative power of the model.
Include some other popoular car brands and models in the sample to have a significantly larger dataset, and to see how different (or similar) are the factors and their weights in determining the price of each car.
I also believe that by analysing the longer descriptions that the sellers providing with NLP tools and machine learning might reveal some more important patterns, and in conjunction with the other data it could even reveal some clues about the genuineness of the listings, and help filter out scams and fraudulent sellers.